source('../env.R')

Select species

These are all the species currently available from the eBird Status and Trends data product Sullivan and Kelling. (2009). https://science.ebird.org/en/status-and-trends/species#columb2

species_available = c("Streptopelia decaocto",
"Streptopelia semitorquata",
"Streptopelia capicola",
"Streptopelia tranquebarica",
"Spilopelia chinensis",
"Macropygia phasianella",
"Chalcophaps indica",
"Ocyphaps lophotes",
"Geopelia humeralis",
"Columbina inca",
"Columbina passerina",
"Columbina minuta",
"Columbina talpacoti",
"Columbina buckleyi",
"Columbina squammata",
"Columbina picui",
"Claravis pretiosa",
"Geotrygon montana",
"Leptotila verreauxi",
"Leptotila jamaicensis",
"Leptotila cassinii",
"Leptotila plumbeiceps",
"Zentrygon costaricensis",
"Zentrygon lawrencii",
"Zentrygon albifacies",
"Zentrygon chiriquensis",
"Zenaida meloda",
"Zenaida asiatica",
"Zenaida aurita",
"Zenaida auriculata",
"Zenaida macroura",
"Treron affinis",
"Treron phoenicopterus",
"Treron sieboldii",
"Treron formosae",
"Ducula aenea",
"Hemiphaga novaeseelandiae",
"Columba livia",
"Columba guinea",
"Columba oenas",
"Columba palumbus",
"Patagioenas cayennensis",
"Patagioenas speciosa",
"Patagioenas squamosa",
"Patagioenas picazuro",
"Patagioenas corensis",
"Patagioenas maculosa",
"Patagioenas leucocephala",
"Patagioenas flavirostris",
"Patagioenas fasciata",
"Patagioenas araucana",
"Patagioenas subvinacea",
"Patagioenas nigrirostris",
"Streptopelia turtur",
"Streptopelia orientalis")

length(species_available)
[1] 55

Filter against library list, as not everything on website is availble in library

species_available = ebirdst_runs$scientific_name[ebirdst_runs$scientific_name %in% species_available]
species_available
 [1] "Geopelia humeralis"         "Patagioenas fasciata"       "Macropygia phasianella"     "Patagioenas araucana"       "Columbina passerina"       
 [6] "Columba palumbus"           "Ocyphaps lophotes"          "Zenaida auriculata"         "Streptopelia decaocto"      "Columbina inca"            
[11] "Zenaida macroura"           "Hemiphaga novaeseelandiae"  "Streptopelia orientalis"    "Patagioenas picazuro"       "Columbina picui"           
[16] "Patagioenas flavirostris"   "Streptopelia tranquebarica" "Streptopelia semitorquata"  "Streptopelia capicola"      "Columba livia"             
[21] "Columbina talpacoti"        "Patagioenas squamosa"       "Patagioenas nigrirostris"   "Columba guinea"             "Spilopelia chinensis"      
[26] "Zenaida meloda"             "Patagioenas leucocephala"   "Leptotila verreauxi"        "Zenaida asiatica"           "Treron phoenicopterus"     
[31] "Zenaida aurita"            

First check that is no knarly taxonomic mapping required for any of these species:

taxonmomic_mapping = read_csv(filename(TAXONOMY_OUTPUT_DIR, 'taxonomy_mapping.csv'))
Rows: 438 Columns: 7── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): species_name, birdlife_common_name, ebird_common_name, ebird_species_name, jetz_species_name
dbl (2): birdlife_id, ebird_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Species with 1 eBird taxa, and multiple Birdlife taxa will reuqire the ranges of the Birdlife taxa to be combined.

taxonmomic_mapping[taxonmomic_mapping$ebird_species_name %in% species_available,] %>%
  group_by(ebird_species_name) %>% 
  summarise(number_of_birdlife_species = n(), birdlife_species_names = paste(species_name, collapse = ",")) %>%
  filter(number_of_birdlife_species > 1)
  • Patagioenas fasciata, will need the range of Patagioenas fasciata and Patagioenas albilinea combined.

  • Spilopelia chinensis, will need the range of Spilopelia chinensis, Spilopelia suratensis, and Patagioenas albipennis combined.

Check whether we need to map eBird species names to Birdlife

taxonmomic_mapping[taxonmomic_mapping$ebird_species_name %in% species_available,] %>%
  filter(ebird_species_name != species_name) 

Species with 1 Birdlife taxa, and multiple eBird taxa will, would mean that we would have cities with no records but potentially have presence test data. We should avoid these species.

taxonmomic_mapping[taxonmomic_mapping$ebird_species_name %in% species_available,] %>%
  group_by(species_name) %>% 
  summarise(number_of_ebird_species = n(), ebird_species_names = paste(ebird_species_name, collapse = ",")) %>%
  filter(number_of_ebird_species > 1)
test_species = species_available

Create regional pool data

regional_pool_data_input = read_csv(filename(BIRDLIFE_WORKING_OUTPUT_DIR, 'urban_distributions.csv'))
Rows: 7215 Columns: 6── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): species_name, city_name, origin, seasonal, presence
dbl (1): city_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Join ranges up

regional_pool_data_input$species_name[regional_pool_data_input$species_name %in% c('Spilopelia suratensis', 'Patagioenas albipennis')] = 'Spilopelia chinensis'
regional_pool_data_input$species_name[regional_pool_data_input$species_name %in% c('Patagioenas albilinea')] = 'Patagioenas fasciata'

regional_pool_data = regional_pool_data_input %>% group_by(species_name, city_id, city_name) %>% summarise() %>% filter(species_name %in% test_species)
`summarise()` has grouped output by 'species_name', 'city_id'. You can override using the `.groups` argument.
regional_pool_data[regional_pool_data$city_name == 'Medellín',]

Create city to pigeon data

Get number of checklists and total effort in each city

urban_checklists = read_csv(filename(EBIRD_WORKING_OUTPUT_DIR, 'urban_checklists.csv'))
Rows: 3542050 Columns: 6── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): checklist_id, locality_id, city_name
dbl  (2): city_id, duration
date (1): obs_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
urban_summary = urban_checklists %>% group_by(city_id, city_name) %>% summarise(total_city_checklists = n(), total_city_effort = sum(duration))
`summarise()` has grouped output by 'city_id'. You can override using the `.groups` argument.
urban_summary

For each of our test species, find the number of checklists and total effort in each city

urban_columbidae_records_raw = read_csv(filename(EBIRD_WORKING_OUTPUT_DIR, 'urban_columbidae.csv'))
Rows: 2593785 Columns: 8── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): observation_count, checklist_id, city_name, species_name, common_name
dbl (3): number_observers, city_id, duration
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
urban_columbidae_records_raw[urban_columbidae_records_raw$city_name == 'Manchester',]

Join the total city data to each pigeon record

urban_columbidae_records = left_join(urban_columbidae_summary, urban_summary)
Joining, by = c("city_id", "city_name")
head(urban_columbidae_records)

We also want records for each city that could have that pigeon (e.g. in regional pool range) but has no eBird records

urban_columbidae_absence_records_part1 = left_join(regional_pool_data, urban_summary)
Joining, by = c("city_id", "city_name")

urban_columbidae_absence_records_part1$total_presence_checklists = 0
urban_columbidae_absence_records_part1$total_presence_effort = 0

urban_columbidae_absence_records = urban_columbidae_absence_records_part1[!is.na(urban_columbidae_absence_records_part1$total_city_checklists),]
urban_columbidae_absence_records
all_columbidae_around_cities = rbind(urban_columbidae_records, urban_columbidae_absence_records) %>%
  group_by(city_id, city_name, species_name, total_city_checklists, total_city_effort) %>%
  summarise(total_presence_checklists = sum(total_presence_checklists), total_presence_effort = sum(total_presence_effort)) %>%
  arrange(city_id, species_name)
`summarise()` has grouped output by 'city_id', 'city_name', 'species_name', 'total_city_checklists'. You can override using the `.groups` argument.

all_columbidae_around_cities$checklist_ratio = all_columbidae_around_cities$total_presence_checklists / all_columbidae_around_cities$total_city_checklists
all_columbidae_around_cities$effort_ratio = all_columbidae_around_cities$total_presence_effort / all_columbidae_around_cities$total_city_effort

all_columbidae_around_cities

Finally join our our test data result to show predicted abundance, remove any NA abundance as there is insufficent data in these places for a prediction.

Plot to see Expected test abundance vs checklists

n = ((Z * Z) * p * (1−p)) / (E * E)

(e.g., 99% corresponds to Z≈2.576Z≈2.576).

So for present = more than 5% of checklists.

p is the estimated probability of success (the proportion of checklists where the species is present). E is the margin of error (half the width of the confidence interval).

(2.576^2 * 0.95 * 0.05) / 0.01^2
[1] 3151.994
plot_city_species = function(city_id) {
  city_name = urban_summary$city_name[urban_summary$city_id == city_id]
  total_checklists = urban_summary$total_city_checklists[urban_checklists$city_id == city_id]
  
  ggplot(columbidae[columbidae$city_id == city_id,], aes(x = species_name, y = checklist_ratio * 100, fill = expected_present)) + 
    geom_bar(stat = "identity") + 
    geom_text(aes(label = round(max_abundance, 2))) + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
    xlab('Species name') + ylab('% of Checklist present') + 
    facet_wrap(~city_id) + 
    geom_hline(yintercept=5, linetype='dotted', col = 'red') +
    labs(title = paste(city_name, total_checklists))
}
plot_city_species(1)

plot_city_species(10)

plot_city_species(14)

plot_city_species(21)

plot_city_species(155)

plot_city_species(7932)

plot_city_species(4328)

plot_city_species(1919)

Sullivan, C. L. Wood, B. L., and S. Kelling. 2009. “eBird: A Citizen-Based Bird Observation Network in the Biological Sciences.” 142 (10): 2282–92. https://doi.org/10.1016/j.biocon.2009.05.006.
---
title: "Generate presence test data using eBird Status & Trends"
output: html_notebook
bibliography: ../ref.bib 
---

```{r}
source('../env.R')
```

# Select species
These are all the species currently available from the eBird Status and Trends data product @eBird2023.
https://science.ebird.org/en/status-and-trends/species#columb2


```{r}
species_available = c("Streptopelia decaocto",
"Streptopelia semitorquata",
"Streptopelia capicola",
"Streptopelia tranquebarica",
"Spilopelia chinensis",
"Macropygia phasianella",
"Chalcophaps indica",
"Ocyphaps lophotes",
"Geopelia humeralis",
"Columbina inca",
"Columbina passerina",
"Columbina minuta",
"Columbina talpacoti",
"Columbina buckleyi",
"Columbina squammata",
"Columbina picui",
"Claravis pretiosa",
"Geotrygon montana",
"Leptotila verreauxi",
"Leptotila jamaicensis",
"Leptotila cassinii",
"Leptotila plumbeiceps",
"Zentrygon costaricensis",
"Zentrygon lawrencii",
"Zentrygon albifacies",
"Zentrygon chiriquensis",
"Zenaida meloda",
"Zenaida asiatica",
"Zenaida aurita",
"Zenaida auriculata",
"Zenaida macroura",
"Treron affinis",
"Treron phoenicopterus",
"Treron sieboldii",
"Treron formosae",
"Ducula aenea",
"Hemiphaga novaeseelandiae",
"Columba livia",
"Columba guinea",
"Columba oenas",
"Columba palumbus",
"Patagioenas cayennensis",
"Patagioenas speciosa",
"Patagioenas squamosa",
"Patagioenas picazuro",
"Patagioenas corensis",
"Patagioenas maculosa",
"Patagioenas leucocephala",
"Patagioenas flavirostris",
"Patagioenas fasciata",
"Patagioenas araucana",
"Patagioenas subvinacea",
"Patagioenas nigrirostris",
"Streptopelia turtur",
"Streptopelia orientalis")

length(species_available)
```

Filter against library list, as not everything on website is availble in library
```{r}
species_available = ebirdst_runs$scientific_name[ebirdst_runs$scientific_name %in% species_available]
species_available
```

## First check that is no knarly taxonomic mapping required for any of these species:
```{r}
taxonmomic_mapping = read_csv(filename(TAXONOMY_OUTPUT_DIR, 'taxonomy_mapping.csv'))
```

Species with 1 eBird taxa, and multiple Birdlife taxa will reuqire the ranges of the Birdlife taxa to be combined. 
```{r}
taxonmomic_mapping[taxonmomic_mapping$ebird_species_name %in% species_available,] %>%
  group_by(ebird_species_name) %>% 
  summarise(number_of_birdlife_species = n(), birdlife_species_names = paste(species_name, collapse = ",")) %>%
  filter(number_of_birdlife_species > 1)
```

* *Patagioenas fasciata*, will need the range of Patagioenas fasciata and Patagioenas albilinea combined.

* *Spilopelia chinensis*, will need the range of Spilopelia chinensis, Spilopelia suratensis, and Patagioenas albipennis combined.

Check whether we need to map eBird species names to Birdlife
```{r}
taxonmomic_mapping[taxonmomic_mapping$ebird_species_name %in% species_available,] %>%
  filter(ebird_species_name != species_name) 
```

Species with 1 Birdlife taxa, and multiple eBird taxa will, would mean that we would have cities with no records but potentially have presence test data. We should avoid these species.
```{r}
taxonmomic_mapping[taxonmomic_mapping$ebird_species_name %in% species_available,] %>%
  group_by(species_name) %>% 
  summarise(number_of_ebird_species = n(), ebird_species_names = paste(ebird_species_name, collapse = ",")) %>%
  filter(number_of_ebird_species > 1)
```

```{r}
test_species = species_available
```

# Create regional pool data 
```{r}
regional_pool_data_input = read_csv(filename(BIRDLIFE_WORKING_OUTPUT_DIR, 'urban_distributions.csv'))
```
```{r}
taxonmomic_mapping[taxonmomic_mapping$ebird_species_name == 'Spilopelia chinensis',]
```

```{r}
taxonmomic_mapping[taxonmomic_mapping$ebird_species_name == 'Patagioenas fasciata',]
```
Join ranges up 
```{r}
regional_pool_data_input$species_name[regional_pool_data_input$species_name %in% c('Spilopelia suratensis', 'Patagioenas albipennis')] = 'Spilopelia chinensis'
regional_pool_data_input$species_name[regional_pool_data_input$species_name %in% c('Patagioenas albilinea')] = 'Patagioenas fasciata'

regional_pool_data = regional_pool_data_input %>% group_by(species_name, city_id, city_name) %>% summarise() %>% filter(species_name %in% test_species)
regional_pool_data[regional_pool_data$city_name == 'Medellín',]
```

# Download Status and Trends data
```{r}
key = read_file(filename(KEYS_DIR, 'ebirdst.txt'))
set_ebirdst_access_key(key, overwrite = T)
ebirdst_data_dir()
```

```{r}
for (species in test_species) {
  print(paste('fetching', species))
  ebirdst_download_status(
    species = species,
    download_abundance = TRUE,
    download_occurrence = FALSE,
    download_count = FALSE,
    download_ranges = FALSE,
    download_regional = FALSE,
    download_pis = FALSE,
    download_ppms = FALSE,
    download_all = FALSE
  )
}
```

## Calculate abundance for each city
Read in our initial city vectors, we will join these to our checklists
```{r}
initial_city_selection = vect(filename(mkdir(GEO_WORKING_OUTPUT_DIR, 'cities'), 'initial_selection.shp'))
```

```{r}
city_data = data.frame(initial_city_selection)
city_data$ID = as.double(rownames(city_data))
city_data
```

```{r}
abundance_data = data.frame()

for (species in test_species) {
  abd <- load_raster(species, resolution = "3km", product = "abundance", period = "seasonal")
  abd_wgs84 = project(abd, initial_city_selection)
  city_abundance = terra::extract(abd_wgs84, initial_city_selection)
  
  if ("resident" %in% colnames(city_abundance)) {
     result = city_abundance %>% group_by(ID) %>% summarise(
       breeding = NA, 
       nonbreeding = NA, 
       resident = mean(resident), 
       species = species
     )
     abundance_data = rbind(abundance_data, inner_join(result, city_data[,c('ID', 'city_id', 'city_name')]))
  } else {
    result = city_abundance %>% group_by(ID) %>% summarise(
      breeding = mean(breeding), 
      nonbreeding = mean(nonbreeding), 
      resident = NA, 
      species = species
    ) 
    abundance_data = rbind(abundance_data, inner_join(result, city_data[,c('ID', 'city_id', 'city_name')]))
  }
}

abundance_data[abundance_data$city_name == 'Manchester',]
```

```{r}
write_csv(abundance_data, filename(EBIRD_WORKING_OUTPUT_DIR, 'ebird_trends_abundance_test_data.csv'))
abundance_data = read_csv(filename(EBIRD_WORKING_OUTPUT_DIR, 'ebird_trends_abundance_test_data.csv'))
```

```{r}
abundance_data$max_abundance = apply(abundance_data[,c('breeding', 'nonbreeding', 'resident')], 1, max, na.rm = T)
abundance_data$max_abundance[abundance_data$max_abundance == -Inf] = NA
abundance_data[abundance_data$city_name == 'Manchester',]
```
```{r}
abundance_data[abundance_data$city_name == 'Medellín',]
```

Filter to regional pools
```{r}
abundance_data_regional = left_join(regional_pool_data, abundance_data, by = c('city_id' = 'city_id', 'city_name' = 'city_name', 'species_name' = 'species'))
abundance_data_regional[abundance_data_regional$city_name == 'Manchester',]
```

```{r}
abundance_data_regional[abundance_data_regional$city_name == 'Medellín',]
```

Are any species now missing that are abundant?
```{r}
city_id_species_pairs_present = paste(abundance_data_regional$species_name, abundance_data_regional$city_id, sep = '::')
abundance_data %>% 
  filter(max_abundance > 0) %>% 
  group_by(city_id, species) %>% 
  filter(!(paste(species, city_id, sep = '::') %in% city_id_species_pairs_present))
```

# Create city to pigeon data

Get number of checklists and total effort in each city
```{r}
urban_checklists = read_csv(filename(EBIRD_WORKING_OUTPUT_DIR, 'urban_checklists.csv'))

urban_summary = urban_checklists %>% group_by(city_id, city_name) %>% summarise(total_city_checklists = n(), total_city_effort = sum(duration))

urban_summary
```


For each of our test species, find the number of checklists and total effort in each city
```{r}
urban_columbidae_records_raw = read_csv(filename(EBIRD_WORKING_OUTPUT_DIR, 'urban_columbidae.csv'))
urban_columbidae_records_raw[urban_columbidae_records_raw$city_name == 'Manchester',]
```

```{r}
urban_columbidae_records_with_checklist_data = left_join(urban_columbidae_records_raw, taxonmomic_mapping, by = c('species_name' = 'ebird_species_name'))
urban_columbidae_records_with_checklist_data[urban_columbidae_records_with_checklist_data$city_name == 'Manchester', ]
```

```{r}
urban_columbidae_summary = urban_columbidae_records_with_checklist_data %>% 
  filter(species_name.y %in% test_species) %>%
  group_by(city_id, city_name, species_name.y) %>% 
  summarise(total_presence_checklists = n(), total_presence_effort = sum(duration)) %>%
  arrange(city_id, species_name.y)

names(urban_columbidae_summary) = c('city_id', 'city_name', 'species_name', 'total_presence_checklists', 'total_presence_effort')
urban_columbidae_summary[urban_columbidae_summary$city_name == 'Manchester',]
```

Join the total city data to each pigeon record
```{r}
urban_columbidae_records = left_join(urban_columbidae_summary, urban_summary)
head(urban_columbidae_records)
```

```{r}
urban_columbidae_records[urban_columbidae_records$city_name == 'Manchester',]
```

We also want records for each city that could have that pigeon (e.g. in regional pool range) but has no eBird records
```{r}
urban_columbidae_absence_records_part1 = left_join(regional_pool_data, urban_summary)

urban_columbidae_absence_records_part1$total_presence_checklists = 0
urban_columbidae_absence_records_part1$total_presence_effort = 0

urban_columbidae_absence_records = urban_columbidae_absence_records_part1[!is.na(urban_columbidae_absence_records_part1$total_city_checklists),]
urban_columbidae_absence_records
```

```{r}
all_columbidae_around_cities = rbind(urban_columbidae_records, urban_columbidae_absence_records) %>%
  group_by(city_id, city_name, species_name, total_city_checklists, total_city_effort) %>%
  summarise(total_presence_checklists = sum(total_presence_checklists), total_presence_effort = sum(total_presence_effort)) %>%
  arrange(city_id, species_name)

all_columbidae_around_cities$checklist_ratio = all_columbidae_around_cities$total_presence_checklists / all_columbidae_around_cities$total_city_checklists
all_columbidae_around_cities$effort_ratio = all_columbidae_around_cities$total_presence_effort / all_columbidae_around_cities$total_city_effort

all_columbidae_around_cities
```

Finally join our our test data result to show predicted abundance, remove any NA abundance as there is insufficent data in these places for a prediction.
```{r}
columbidae = left_join(all_columbidae_around_cities, abundance_data, by = c('species_name' = 'species', 'city_id' = 'city_id', 'city_name' = 'city_name'))
columbidae = columbidae[!is.na(columbidae$max_abundance),c('city_id', 'city_name', 'species_name', 'total_city_checklists', 'total_presence_checklists', 'checklist_ratio', 'max_abundance')]
columbidae$expected_present = columbidae$max_abundance > 0.1
columbidae
```

# Plot to see Expected test abundance vs checklists
```{r}
ggplot(columbidae[columbidae$species_name == 'Columba palumbus',], aes(y = checklist_ratio, x = max_abundance, color = log(total_city_checklists))) + geom_jitter() + geom_jitter() + scale_colour_gradient2(
  low = "red",
  mid = "yellow",
  high = "darkgreen",
  midpoint = log(100),
  space = "Lab",
  na.value = "grey50",
  guide = "colourbar",
  aesthetics = "colour",
)
```

n = ((Z * Z) * p * (1−p)) / (E * E)

(e.g., 99% corresponds to Z≈2.576Z≈2.576).
 
So for present = more than 5% of checklists.

p is the estimated probability of success (the proportion of checklists where the species is present).
E is the margin of error (half the width of the confidence interval).

```{r}
(2.576^2 * 0.95 * 0.05) / 0.01^2
```

```{r}
urban_summary[urban_summary$total_city_checklists > 3152,]
```

```{r}
plot_city_species = function(city_id) {
  city_name = urban_summary$city_name[urban_summary$city_id == city_id]
  total_checklists = urban_summary$total_city_checklists[urban_checklists$city_id == city_id]
  
  ggplot(columbidae[columbidae$city_id == city_id,], aes(x = species_name, y = checklist_ratio * 100, fill = expected_present)) + 
    geom_bar(stat = "identity") + 
    geom_text(aes(label = round(max_abundance, 2))) + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
    xlab('Species name') + ylab('% of Checklist present') + 
    facet_wrap(~city_id) + 
    geom_hline(yintercept=5, linetype='dotted', col = 'red') +
    labs(title = paste(city_name, total_checklists))
}
```

```{r}
plot_city_species(1)
```

```{r}
plot_city_species(10)
```

```{r}
plot_city_species(14)
```

```{r}
plot_city_species(21)
```

```{r}
plot_city_species(155)
```

```{r}
plot_city_species(7932)
```

```{r}
plot_city_species(4328)
```

```{r}
plot_city_species(1919)
```

```{r}
plot_city_species(1308)
```


